-
Notifications
You must be signed in to change notification settings - Fork 358
[AMORO-3775] Add support for metric-based refresh event trigger in TableRuntimeRefreshExecutor #3776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[AMORO-3775] Add support for metric-based refresh event trigger in TableRuntimeRefreshExecutor #3776
Conversation
cd57764 to
c8734fb
Compare
c352653 to
f00825b
Compare
7782ac7 to
ab4b971
Compare
…meRefreshExecutor
ab4b971 to
519d183
Compare
Sure, I've updated the branch and added the new evaluation criteria discussed earlier (see Step 1 for details). The current conditions for triggering pendingInput evaluation based on metrics are as follows: Note that this update now supports MIX_ICEBERG tables, whereas previously only ICEBERG format was supported. Please take a look when you are free. Looking forward to your feedback! @xxubai @zhoujinsong @klion26 |
| "self-optimizing.evaluation.average-file-size.tolerance"; // the minimum tolerance value for | ||
| // the average | ||
| // partition file size (between 0 and (self-optimizing.target-size)) | ||
| public static final MemorySize SELF_OPTIMIZING_EVALUATION_AVERAGE_FILE_SIZE_TOLERANCE_DEFAULT = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use byte size to unify the file size unit?
|
|
||
| CloseableIterable<PartitionFileBaseInfo> tableFiles = | ||
| getTableFilesInternal(amoroTable, null, null); | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use try catch with resource to close the io automaticly
| try { | |
| try (CloseableIterable<PartitionFileBaseInfo> tableFiles | |
| = getTableFilesInternal(amoroTable, null, null)) { | |
| for (PartitionFileBaseInfo fileInfo : tableFiles) { | |
| refreshPartitionBasicInfo(fileInfo, partitionBaseInfoHashMap); | |
| } | |
| } catch (IOException e) { | |
| LOG.warn("Failed to close the manifest reader.", e); | |
| } |
| MixedTable table, long minTargetSize) { | ||
| Map<String, PartitionBaseInfo> partitionBaseInfoHashMap = new HashMap<>(); | ||
| CloseableIterable<PartitionFileBaseInfo> tableFiles = getTableFilesInternal(table, null, null); | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simply use a try-with-resources statement.
| return true; | ||
| } | ||
|
|
||
| ExecutorService executorService = ThreadPools.getWorkerPool(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a dedicated thread pool to avoid thread congestion.
| ExecutorService executorService = ThreadPools.getWorkerPool(); | |
| ExecutorService executorService = IcebergThreadPools.getPlanningExecutor(); |
| return getTableFilesInternal(mixedTable, partition, specId); | ||
| } | ||
|
|
||
| private CloseableIterable<PartitionFileBaseInfo> getTableFilesInternal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, after the event-triggered evaluation, a full table scan is performed to collect partition information, which can be very expensive (especially for large tables with hundreds of thousands of files). Perhaps we can optimize this part when upgrading the Iceberg version and introducing PartitionStatistics.
| package org.apache.amoro.table; | ||
|
|
||
| /** Detailed table partition properties list. */ | ||
| public class TablePartitionDetailProperties { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply this name? such as PartitionSummaryProperties
| && lastOptimizedSnapshotId != defaultTableRuntime.getCurrentSnapshotId())) { | ||
| tryEvaluatingPendingInput(defaultTableRuntime, mixedTable); | ||
| if (!defaultTableRuntime.getOptimizingConfig().isEventBasedTriggerEnabled() | ||
| || MetricBasedRefreshEvent.isEvaluatingPendingInputNecessary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this cause an additional full table scan compared to before?
In addition, we should also check whether optimization is enabled, so it would be better to combine this with tryEvaluatingPendingInput to avoid extra overhead.
Why are the changes needed?
Close #3775.
Brief change log
Add support for MSE based refresh event:
How was this patch tested?
Add some test cases that check the changes thoroughly including negative and positive cases if possible
Add screenshots for manual tests if appropriate
Run test locally before making a pull request
Documentation